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Abstract 

Ye showed recently that the simplex method with Dantzig pivoting rule, as well as Howard's policy 
iteration algorithm, solve discounted Markov decision processes (MDPs), with a constant discount 
factor, in strongly polynomial time. More precisely, Ye showed that both algorithms terminate 
after at most 0(f^ log(y3-)) iterations, where n is the number of states, m is the total number of 
actions in the MDP, and < 7 < 1 is the discount factor. We improve Ye's analysis in two respects. 
First, we improve the bound given by Ye and show that Howard's policy iteration algorithm actually 
terminates after at most 0\^- log( jzzz)) iterations. Second, and more importantly, we show that 
the same bound applies to the number of iterations performed by the strategy iteration (or strategy 
improvement) algorithm, a generalization of Howard's policy iteration algorithm used for solving 
2-player turn-based stochastic games with discounted zero-sum rewards. This provides the first 
strongly polynomial algorithm for solving these games, resolving a long standing open problem. 
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1 Introduction 



Markov Decision Processes (MDPs) are widely used in operations research, machine learning and 
related disciplines, to model long-term sequential decision making in uncertain, i.e., stochastic, envi- 
ronments. Stochastic Games (SGs), a generalization of MDPs to a 2-player setting, are widely used 
to model long-term sequential decision making in stochastic and adversarial environments. MDPs 
were first introduced by Bellman [2J. SGs, which form a more general model, were introduced slightly 
earlier by Shapley [32]. Many variants of MDPs and SGs were studied in the literature. The MDPs 
and SGs considered in this paper are infinite-horizon discounted MDPs/SGs. The SGs we consider 
are turn-based and we thus refer to them as 2-player Turn-Based Stochastic Games (2TBSG). 

MDPs may be viewed as degenerate 2TBSGs in which one of the players has no influence on the 
game. For a thorough treatment of MDPs and their numerous practical applications, see the books 
of Howard [18], Derman [9], Puterman [29] and Bertsekas [3j. For a similar treatment of SGs, see the 
books of Filar and Vrieze [13] and Neyman and Sorin [28] . 

A 2TBSGs is composed of a finite set of states and a finite set of actions. Each state is controlled 
by one of the players. In each time unit, the game is in exactly one of the states. Each state has 
a non-empty set of actions associated with it. The player controlling the state must play one of 
these actions. Playing an action incurs an immediate cost, and results in a probabilistic transition 
to a new state according to a probability distribution that depends on the action. The process goes 
on indefinitely. The first player tries to minimize the total expected discounted cost of the infinite 
sequence of actions taken, with respect to a fixed discount factor. The second player tries to maximize 
this total discounted cost. Discounting captures the fact that a cost incurred at a later stage has a 
smaller effect than the same cost incurred at an earlier stage. For formal definitions, see Section [2j 

A policy or a strategy for a player is a possibly probabilistic rule that specifies the action to be taken 
in each situation, given the full history of play so far. One of the fundamental results in the theory 
of MDPs and 2TBSGs, is that both players have positional optimal strategies. A positional strategy 
is a strategy that is both deterministic and memoryless. A memoryless strategy is a strategy that 
depends only on the current state, and not on the full history. MDPs and 2TBSGs are solved by 
finding optimal positional strategies for the players. 

MDPs can be solved using linear programming (d'Epenoux [SJ, Derman [9j). The preferred way of 
solving MDPs in practice, however, is Howard's [IS] Policy Iteration algorithm. The policy iteration 
algorithm maintains and iteratively improves a policy by performing "obvious" improving switches 
(for details, see Section [5|). Howard's algorithm may be viewed as a parallel version of the simplex 
algorithm in which several pivoting steps are performed simultaneously. The problem of determining 
the worst case complexity of Howard's algorithm was stated explicitly at least 25 years ago. (It is men- 
tioned, among other places, in Schmitz [31j . Littman et al. [23] and Mansour and Singh [25].) Meister 
and Holzbaur [27] established, decades ago, that the number of iterations performed by Howard's al- 
gorithm, when the discount factor is fixed, is polynomially bounded in the bit size of the input. Their 
bound, however, is not polynomial in the number of states and actions of the MDP. The first strongly 
polynomial time algorithm for solving MDPs was an interior point algorithm of Ye [34] . 

Very recently, Ye [35] presented a surprisingly simple proof that Howard's algorithm terminates after 
at most 0(jr^ iterations, where n is the number of states, m is the total number of actions, 

and < 7 < 1 is the discount factor. In particular, when the discount factor is constant, the number of 
iterations is 0(mn log n). Since each iteration only involves solving a system of linear equations, Ye's 
result established for the first time that Howard's algorithm is a strongly polynomial time algorithm, 
when the discount factor is constant. Ye's proof is based on a careful analysis of an LP formulation 
of the MDP problem, with LP duality and complementary slackness playing crucial roles. 

We significantly improve and extend Ye's [35] analysis. We show that Howard's algorithm actually 
terminates after at most 0[j^ log(yr^)) iterations, improving Ye's bound by a factor of n. In- 
terestingly, the only added ingredient needed to obtain this significant improvement is a well-known 
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relationship between Howard's policy iteration algorithm and Bellman's [2j value iteration algorithm, 
an algorithm for approximating the values of MDPs. 

More significantly, and more surprisingly, we are able to obtain the same Oiyj^ log( jzr)) bound 
also for the Strategy Iteration (or Strategy Improvement) algorithm for the solution of 2TBSGs. This 
supplies the first strongly polynomial algorithm for solving 2TBSGs, with a fixed discount factor, 
solving a long standing open problem. 

The strategy iteration algorithm is a natural generalization of Howard's policy iteration algorithm 
that can be used to solve 2TBSGs. The strategy iteration algorithm for discounted 2-player games 
is apparently first described by Rao et al. [30J. Hoffman and Karp [17] earlier described a related 
algorithm for a somewhat different class of SGs. 

Prior to our strongly polynomial bound for the strategy iteration algorithm, the best time available 
on the problem of solving discounted 2TBSGs was a polynomial, but not strongly polynomial, bound 
of Littman [22], obtained essentially using value iteration. The best time bound expressed solely in 
terms of the number states and actions was a subexponential bound of Ludwig [24J. (See also Bjorklund 
and Vorobyov [HE] and Halman [16] .) Interestingly, these subexponential bounds are obtained using 
randomized variants of the strategy iteration algorithm that mimic the combinatorial subexponential 
algorithms of Kalai [201 [2"T] and Matousek, Sharir and Welzl [26] for solving LP -type problems. 

What makes our analysis of the strategy iteration algorithm surprising is the fact that Ye's analysis 
relies heavily on the LP formulation of MDPs. In contrast, no succinct LP formulation is known 
for 2TBSGs. (Natural attempts fail. See Condon [7].) Our proof is based on finding natural game- 
theoretic quantities that correspond to the LP-based quantities used by Ye, and by reestablishing, via 
direct means, (improved versions) of the bounds obtained by Ye using LP duality. 

Ye's [35j results and our results, combined with the recent results of Friedmann p3] and Fearnley [12] . 
supply a complete characterization of the complexity of the policy/strategy iteration algorithm for 
MDPs/2TBSGs. The policy/strategy iteration algorithms are strongly polynomial for a fixed discount 
factor, but exponential for non- discounted problems, or when the discount factor is part of the input. 
(In non-discounted problems the discounting criteria is replaced by limiting average criteria. In a 
sense, this is equivalent to letting the discount factor tend to 1. See, e.g., Derman [9].) 

The rest of this paper is organized as follows. In Section [2] we define the 2-player turn-based stochastic 
games (2TBSG) studied in this paper. In Sections El El and El we summarize known results regarding 
these games. For completeness, these sections contain concise, but complete, proofs of all results. 
(The proofs in these three sections are not the innovative part of this paper and may be skipped 
at first reading.) Finally, in Section [6] we obtain our innovative strongly polynomial bound on the 
complexity of the celebrated strategy iteration algorithm, solving a long-standing open problem. We 
end in Section [7J with some concluding remarks and open problems. 

2 2-player turn-based stochastic games 

Discounted stochastic games were first studied by Shapley [32J. In his games, the players perform 
simultaneous, or concurrent, actions. We consider the subclass of turn-based stochastic games. 

We briefly review the informal definition of 2-Player Turn-Based Stochastic Games (2TBSGs), before 
giving a formal definition. A game is composed of states and actions. It starts at some initial state 
and proceeds, in discrete steps, indefinitely. In each time step one of the players plays an action. (The 
game is thus a turn-based or perfect information game.) Each action has a cost associated with it. 
This is the cost paid by player 1 to player 2 when this action is played. (The game is therefore a 
zero-sum game.) Each action also has a probability distribution on states associated with it. The next 
state, after playing a particular action, is chosen randomly according to this probability distribution. 
(The game is, in general, stochastic.) Finally, the game is discounted. The first player tries to minimize 
the expected total discounted cost, while the second player tries to maximize it. 
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Definition 2.1 (Actions). An action a over a set of states S is composed of a triplet (s(a),p(a),c(a)), 
where s(a) G S is the state from which a can be played, p(a) G A(£) is a probability distribution over 
states according to which the next state is chosen when a is played, and c(a) S 1 is the cost of a. 

Definition 2.2 (2-Player Turn-Based Stochastic Games). A 2-Player Turn-Based (Discounted) 
Stochastic Game (2TBSG) is a tuple G = (Si, £2, A, 7), where S\ and £2 are the set of states controlled 
by players 1 and 2, respectively, and A is a set of actions. We assume that S\ fl £2 = and let 
S = S\ U £2. For every i G S, we let A\ = {a G A \ s(a) = i} be the set of actions that can be played 
from i. We assume that Ai / 0, for every i G S. We let A 1 = Uj e 5 1 ^4j and A 2 = Uj 6 s 2 ^4j be the 
sets of all actions that can be played by players 1 and 2, respectively. Finally, < 7 < 1 is a fixed 
discount factor. // the infinite sequence of actions taken by the two players is do, a±, . . ., then the total 
discounted cost of this action sequence is X^fc>o V^ ^)- 

If one of the players has only a single action available from each state under her control, the game 
degenerates into a 1-player game known as a Markov Decision Process. (This happens, in particular, 
when S\ = or £2 = 0.) 

We next define the probability and action matrices of 2TBSGs. These matrices provide a compact 
representation of 2TBSGs that greatly simplifies their manipulation. Throughout the paper, we use 
n = |£| and m = \A\ to denote the number of states and actions, respectively, in a game. 

Definition 2.3 (Probability and action matrices). Let G = (S 1 ,S 2 ,A,j) be a 2TBSG. We 
assume, without loss of generality, that S = S\ U £2 = [n] and A = [m\. We let P G W nxn , where 
Pa,i = p( a )i is the probability of ending up in state i after taking action a, for every a G A = [m] and 
i G S = [n], be the probability matrix of the game, and c G R m , where c a = c(a) is the cost of action 
a G A = [m], be its cost vector. We also let J G ]R mxn be a matrix such that J a ^ = 1 if and only if 
a G Ai, and otherwise. Finally, we let Q = J — 7P be the action matrix of G. 

It is interesting to note that a 2TBSG is fully specified by its action matrix Q = J — 7P, its cost 
vector c, and the partition of S = [n] into Si and £2- (Action matrices may be thought of as a 
stochastic and discounted generalization of the incidence matrices of directed graphs.) 

Definition 2.4 (Strategies, strategy profiles). A (positional) strategy 7Tj for player j , is a mapping 
TTj : Sj — >■ A such that TTj(i) G Ai, for every i G Sj. We say that player j uses strategy -Kj if whenever 
the game is in state i, player j chooses action TTj(i). A strategy profile it = (7ri,7r2) is simply a pair 
of strategies for the two players. We let ILj = Tlj(G), for j G {1,2}, be the set of all strategies of 
player j, and let II = 11(G) = IT x II2 be the set of all strategy profiles in G. 

We note that a strategy profile tt = (717,^2) may be viewed as a mapping it : S — > A, i.e., as a 
strategy in a 1-player version of the game. All strategies considered in this paper are positional. 
When convenient, we also view a strategy ttj or a strategy profile n as subsets 7Tj(£),7r(£) C A. A 
strategy profile it = (717,-^2), when viewed as a subset of A, is simply the union 77 U 7T2. We let 
Pn- G R nxn be the matrix obtained by selecting the rows of P whose indices belong to tt. Note that P v 
is a (row) stochastic matrix. Its elements are non- negative and the elements in each row sum to 1. 
Similarly, c w G W 1 is the vector containing the costs of the actions that belong to tt. We conveniently 
have J n = I and Q T = I — jPjt, for every strategy profile tt. 

Definition 2.5 (Value vectors). For every strategy profile tt = {tt\,tt2) G II, we let v^ = G W 1 

be a vector such that {v^i, for every i G £, is the expected total discounted cost when the game starts 
at state i, player 1 uses strategy tt\, and player 2 uses strategy tt2- 

Given two vectors u, v G M. n , we say that u < v if and only if Uj < Vj, for every 1 < i < n. We say 
that u < v if and only if u < v and u / v. 
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Definition 2.6 (Optimal counter strategies). Let G be a 2TBSG and letTT2 G 112(G) be a strategy 
of player 2. A strategy tti for player 1 is said to be an optimal counter-strategy against TT2, if and 
only if v nij7T2 < v 7T ^ 7T2 , for every tt[ G IIi(G). Similarly, a strategy TT2 for player 2 is said to be 
an optimal counter-strategy against tt\, if and only if v vrijVr2 > v 7rij7r ^, for every tt' 2 G 112(G). For 
every tt\ £ ITi(G) ; u>e Zei T2(7ri) 6e an optimal counter strategy against tt\, i/ one exists. For every 
7T2 G Il2(G) ; we Zei ri(7T2) 6e an optimal counter strategy against tx^, if one exists. 

It is not immediately clear that optimal counter strategies always exist. (Note, that v 7ri>7r2 < 7T2 
and v^^ttj, > v WljW / are vector inequalities. As defined, optimal counter strategies need to be optimal 
for every initial state.) Furthermore, optimal counter strategies, if they exist, need not be unique. It 
is well known, however, that optimal counter strategies do always exist, as we shall also show below. 

In a two-player zero-sum game, an optimal strategy is by definition one that secures the best possible 
guarantee on the expected payoff against any opponent. As with finite games, pairs of optimal strate- 
gies in a zero-sum stochastic game coincide with the Nash equilibria of the game. This was established 
by Shapley [32]. For brevity, we take this characterization to be the definition of an optimal strategy. 

Definition 2.7 (Optimal strategies). A strategy profile tt = (vri,^) G n(G) is said to be optimal 
if and only if tt\ is an optimal counter strategy against TT2, and TT2 is an optimal counter strategy 
against tt\ . In such a case we also say that tt\ is an optimal strategy for player 1 and that TT2 is an 
optimal strategy for player 2. 

Shapley [32] also established the following theorem. 

Theorem 2.8. Every 2TBSG has an optimal strategy profile. If tt and tt' are two optimal strategy 
profiles then v n = v n i . 

Theorem 12.81 immediately implies the existence of optimal counter strategies against any strategy. It 
is easy to see that tt\ is an optimal strategy for player 1 if and only if v 7ri)T2 ( 7ri ) < v^^^q, for every 
tt[ G IIi. An analogous condition clearly holds for player 2. The main result of this paper is a proof 
that a pair of optimal strategies can be computed in strongly polynomial time, when the discount 
factor is constant. 



3 Basic results 

For any strategy profile tt, the matrix (J — jP^) plays a prominent role in the sequel. (Recall that P n 
is the matrix obtained by selecting the rows of P that correspond to actions that belong to tt.) We 
thus start with the following lemma whose trivial proof is omitted. 

Lemma 3.1. For any strategy profile tt, the matrix (I — "fP n ) is invertible and 

{I-IPttT 1 = ^2(~/Pn) k - 

k>0 

All entries of (I — 7-P 7r )~ 1 are non-negative and the entries on the diagonal are strictly positive. 
Lemma 3.2. For every strategy profile tt G II and every < 7 < 1, we have 

v w = (I - 7P 7r ) -1 c 7r . 

Proof. When the players use the strategy profile tt, the process becomes a Markov chain with rewards 
with transition matrix P n . In particular, for every i,j G [n] and every k > 0, (P^)ij is the probability 
that a game that starts at state i is in state j after exactly k steps. The expected total discounted 
costs, starting from all states are thus 

v - = (E(^ P ")V = (i-jp.r 1 ^. □ 
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Definition 3.3 (Modified costs). The modified cost vector c n 6 M m corresponding to a strategy 
profile tt is defined to be 

e = c-(j- 7 pk. 

The modified cost vector c 71 " is obtained from c via a potential transformation that uses as a vector 
of potentials. (If h : V — > R is a function assigning potentials to the states, then the modified cost Ch(a) 
is defined as Ch{a) = c(a) — h(a) + 7 ^j £ sP( a )jh(j)-) 

It is important to stress the difference between c n € W 1 , the vector obtained by selecting the entries 
of c corresponding to strategy profile tt, and the modified cost vector c 71 " = c — ( J — r yP)v 7T E M m of 
Definition 13.31 (This distinction may be confusing at first, but it is extremely useful.) 

We let be an all zero vector. (The dimension of will depend on the context.) Using Lemma 13.21 
we immediately get the following basic but important relation. 

Lemma 3.4. For every strategy profile tt we have (c n ). K = 0. 

Definition 3.5 (Modified value vectors). For every two strategy profiles tt,tt', we let v^, be the 

value vector of tt' corresponding to the modified cost vector c n . 

Lemma 3.6. For every two strategy profiles tt',tt we have 
Proof. By Definition 13.31 and Lemma 13.21 we have 

v£ = (i- 7 ivrw 

= (J- 7 P w /)- 1 (cv-(I-7iVK) 



Recall that A 1 = Ujg^A and A 2 = \J i& s 2 Ai- 

Lemma 3.7. (Optimality condition) A strategy profile tt is optimal iff (c*)^ > and (c 7r )^2 < 0. 

Proof. Suppose that {c n )^i > and (c 71 ")^ < 0. Let tt = (tti,tt2). We prove that tt\ is an optimal 
counter strategy against -K2. By Lemma IBTil we have (c 7r ) 7I - 1 = 0, (c 7T ) 7T2 = and hence ^ 2 = 0. 
For every tt[ € IT, we have (c 71 ")^/ > 0, as tt^ C A 1 , and hence (c 7r ) 7r / 7r2 > 0. Thus clearly v 7 ^, ^ > 
= v£ n , and tt\ is indeed an optimal counter strategy against tt 2 ■ The proof that 7r 2 is an optimal 
counter strategy against tt\ is analogous. 

Suppose now that there is an action a E Aj , where ig € Si, such that (c 7r ) a < 0. (The case in which 
io & S2 and (c 7r ) (I > is analogous.) Again, let tt = (tti,^)- Let tt^ € ITi be a policy such that 
Tr[(i) = TT\{i), if i ^ io, and vr^(io) = a. We then have (c*)^ < and {c K ) 7T2 = 0. Thus v 7 ^, ^ < 0. 

(The strict inequality follows from Lemma l3.ll All entries of (/ — 7P 7r ^ 7r2 ) _1 are non-negative, and the 
entries on the diagonal are strictly positive.) Thus tt\ is not an optimal counter strategy against TT2- □ 

In the second part of the proof above, tt^ is obtained from tt\ by a profitable switch. Profitable switches 
are closely related to the pivoting steps performed by the simplex algorithm. They also lie at the core 
of the strategy iteration algorithm whose analysis is the main focus of this paper. 

Definition 3.8 (Flux vectors). For every strategy profile tt, let x^ € R lxn be a row vector such 
that (x^j, for every i 6 S, is the sum of the discounted costs, over all states, when the cost of 
action tt{€) is 1, while the cost of all other actions is 0, and when the players use strategy profile tt. 

We let e = (1, 1, ... , 1) T £ W 1 be an all one vector. Using Lemma 13.2^ we easily get 
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Lemma 3.9. For every strategy profile it, we have 

Xn = e T {I- 1 P n )-\ 

It is in fact possible to view Lemma 13.91 as the definition of x^. The meaning of the flux vectors given 
in Definition 13.81 is not used in the sequel. (The flux vectors are intimately related to the dual linear 
program formulation of MDPs.) 

Lemma 3.10. For every strategy profile it, we have 

n 



1-7 

Proof. By Lemma 13.91 Lemma 13,1} and the fact that e T (P w ) k e = n, for every k > 0, we have: 



X7r e = e T (I - 7P ff ) _1 e = £ e T ( 7 P 7r ) fc e = n £ 7 fe = - 



n 



□ 



7 

fc>0 k>0 1 



Lemma 3.11. For every strategy profile ir, we have 



T _ 
e ~v n x^c^. 



Proof. By Lemma 13.21 and then Lemma 13.91 we get e T v 7T = e T (I — r yP w ) 1 c 7r = x^-c^. □ 
Lemma 3.12. For every strategy profile ir, we have 

e T (v w/ - v^) = x^fc 71 ")^/. 
Proof. By Lemma 13.61 and then Lemma 13. Ill we have e T — v w ) = e T v^, = x 7r /(c 7r ) 7r /. □ 

4 Value iteration 

If x G M. m and B C [m], we let min^ x = min jG s x jj and similarly max^x = maxjgs x^. We also let 
argmin B x = argmin jeB Xj and argmax s x = argmax jGB x^. 

Definition 4.1 (Value iteration operator). The value iteration operator T : M n — > K n is defined 
as follows: 

JminA I c + 7Pv, if i G Si, 
(rnaxi, c + 7 Pv , if i G S 2 - 



(Tv) 



The operator T is a contraction with Lipschitz constant 7. 

Lemma 4.2. For every u, v G M. n we have \\Tu — T^v\\oo < 7 || u — v ||oo- 

Proof. Assume that i G S\ and that (Tu) 8 > (Tv)j. (The other cases are analogous.) Let a = 
argmin^. c + 7PU and b = argmin^. c + 7PV. Then, 

(Tu-Tv)i = (c a + 7 P a u) - (c fe + 7 P fe v) 

< (c ft + 7 P b u) - (c 6 + 7P 6 v) 
= 7 P fe (u-v) 

< 7IIU-VHOO. 

The last inequality follows from the fact that the elements in P& are non- negative and sum- up to 1. □ 
Banach fixed point theorem now implies that the value iteration operator T has a unique fixed point. 
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Corollary 4.3. There is a unique vector v* G R™ such that Tv* = v*. 

We next define the strategy extraction operators that play an important role in this section, and the 
central role in the next section. 

Definition 4.4 (Strategy extraction operators). The strategy extraction operators V\ : M n — > 111 
and V2 '■ K n — > n 2 and V : W 1 — > H are defined as follows: 

(7 ? iv)(i) = argmin A . c + 7PV , i G Si, 
(7- , 2 v )(*) = argmax^ c + 7PV , i G SV 

and 

p v = (PiV,P 2 v). 

The following relation between the value iteration and strategy extraction operator is immediate. 
Lemma 4.5. For every v G M n we /love Tv = (c + jPv) 7T , where tt = Vv. 

The following simple lemma provides an interesting relation between the strategy extraction operator 
and modified cost vectors. 

Lemma 4.6. For every strategy profile tt we have 

(PiVttXO = argmin^c 71 " , i G Si, 
C^VttXY) = argmax^c 71 " , ieS 2 . 

Proof. Let v = v^. If a G then, 

(c 7r ) a = c a - (Vj - 7P a v) = (c + 7Pv) a - Vj. 
Thus, if a, a' G .4i, then (c + 7Pv) a < (c + 7-Pv) a , if and only if (c 7r ) a < (c 7 ^,. □ 

The following lemma supplies a simple proof of Theorem 12.81 (This is, in fact, the original proof given 
by Shapley [32].) 

Lemma 4.7. Let v* G M n 6e i/ie unique fixed point of T and Ze£ 7r = Tv*. Then, tt is an optimal 
strategy profile. 

Proof. By Lemma 14.51 we get that v* = Tv* = c n + jP^v* . By Lemma l3.2l we get v,,- = v*. We next 
show that tt satisfies the optimality condition of Lemma 13.71 an d hence is an optimal strategy profile. 
Suppose that i G S\ and that a G A{. By Lemma |4"UI we have 7r(i) = (Viv*)(i) = argmin^. c 71 ". As 
(, c7T )-jr(i) = 0) we get that (c 7r ) a > 0. Similarly, if i G S2 and a G A4, we get that (c 7r ) a < 0. □ 



The value iteration algorithm, given on the left-hand side of Figure [H repeatedly applies the value 

/fc=0> 



iteration operator T to an initial vector u° G M™, generating a sequence of vectors (u fc )^ =0 , where 



u fc+1 = 7~u k , until the difference between two successive vectors is small enough, i.e., ||u fc 1 — u fe || 00 < e. 

Lemma 4.8. Let (u k )^ =0 be the sequence of value vectors generated by a call Value-Iteration(u°, e), 
for some e > 0. Let v* be the optimal value vector. Then, for every < k < N we have 

H^-vloo < 7 fc ||u°-v*|| 00 . 

Proof. By Lemma 14.21 and the fact that Tv* = v* , we have 

llu'-vloo = UTu^-Tvloo < 7||u fe - 1 -v*|| 00 . 
The claim follows easily by induction. □ 

It follows immediately from Lemma [4.8l that for any u G M. n , the infinite sequence of vectors generated 
by the call Value-Iteration(u°, 0) converges to the optimal value vector v*. Also, for every e > 0, 
the call Value-Iteration(u°, e) eventually terminates. 
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Function Value-Iteration (u, e) 



Function Strategy-Iteration (cr ) 



k <- 
repeat 



k <r- 

repeat 



k <r- k + 1 



r fc = r 2 (fJ fc ) 

V fc «— V CT fc T fc 



until ||u fc 1 — u 
return \i k 



k 



oo 



< e 



^fe+l ^_ -p lV fe Q£ possible a 
k^k + l 



.fc+i 




until <7' 



.fc-i 



= O" 



k 



return a 



.k 



Figure 1: The Value-Iteration and Strategy-Iteration algorithms. 



5 Strategy iteration 

The strategy iteration algorithm is given in the right-hand side of Figure [TJ It was first described for 
the MDP case by Howard [TS] and is called policy iteration or Howard's algorithm in that context. It 
was described for 2-player stochastic games by Rao et al. [30]. (Their algorithm actually works on 
more general imperfect information games for which it is a non-terminating approximation algorithm.) 

The strategy iteration algorithm receives an initial strategy <7° of player 1, and generates a sequence 
■K k = [a k ,r k ] of strategy profiles of the two players, ending with an optimal strategy profile. Each 
iteration of the algorithm receives a strategy a k and produces an improved strategy a k+1 as follows. 
The algorithm first computes an optimal counter-strategy T k = T2(a k ) for player 2 against a k . (We 
assume here that this can be done in strongly polynomial time. One way of doing it is to apply 
the strategy iteration algorithm on a restricted game in which a k is the only strategy available to 
player 1.) Next, it evaluates the strategy profile 7r k = (a k ,r k ), by solving a system of linear equations, 
and obtains its value vector v fc = v^. It then lets a k+1 = Viv^k. Ties are broken, if possible, in favor 
of actions that are in a k . (This is important, as termination is not guaranteed without this provision.) 
The algorithm terminates when two consecutive strategies a k and o~ k+1 are identical. 

The step a k+1 = "Piv^fc is the main step of the strategy iteration algorithm. As we shall (implicitly) 
see below, cr fc+1 is obtained from a k by performing a collection of improving switches. 

To prove the correctness of the Strategy-Iteration algorithm we use the following lemma. (Note 
that 7T 1 in the lemma is obtained from 7r° using one iteration of the Strategy-Iteration algorithm.) 

Lemma 5.1. Let a € III, vr° = (a , t 2 (<t )) and a 1 = Viv n o, vr 1 = (a 1 , r 2 (cr 1 )). Then v n o > \ n i. 

Proof. We show that v^o = > v n ° , which by Lemma 13.61 implies that v^o > v^i . To show that 
< 0, we show that (c n °) 7T i < 0. The fact that (c^ )^ < follows from the fact that for every 
j G Si we have cr l {i) = argmin^. c n ° and hence (c^ )^^ < {c n °) a o^ = 0. The fact that (c n °) T i < 
follows from fact that r° is an optimal counter strategy against a , so in fact (c* ) A 2 < 0. □ 

Lemma 5.2. For every initial strategy a , Strategy-Iteration(ct°) terminates after a finite number 
of iterations. If (v fc )^L is the sequence of value vectors generated by the call, then, v fc_1 > v fc > v* ; 
for every 1 < k < N. Furthermore, v^ -1 = = v* and tt 1 ^" 1 = n is an optimal strategy profile. 

Proof. The claim v fe_1 > v fc , for every 1 < k < N follows easily from Lemma 15. II by induction. Next, 
we note that if v fe_1 = v fc , for some k, then by the reasoning used in the proof of Lemma 15-H we 
must have (c 71 " )^i > and (c n )^2 < 0. By the optimality condition, we get that ir k ~ l is an 
optimal strategy profile. By the tie breaking mechanism used, we also get that ir k = ir k ~ 1 . Finally, 
the fact that v fc_1 > v fc , for every 1 < k < N, implies that strategy profiles encountered cannot 
repeat themselves. As there is only a finite number of such profiles, the sequence of strategy profiles 
generated must be finite. □ 
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We next relate the sequences of value vectors obtained by running Strategy-Iteration(o"°) and 
Value-Iteration^o, e), where 7r° = (a , T2(o" )). The following lemmas for the case of MDPs 
are well-known and appear, e.g., in Meister and Holzbaur [57]. The proofs for the 2-player case are 
essentially identical. (They may be folklore.) 

Lemma 5.3. Let a G vr° = (a°,T 2 (a )), and a 1 = Viv n o, vr 1 = (a 1 , T 2 {cr 1 )). Then T v^o > v T i. 

Proof. Let i G S\. As cr 1 (i) = argmin^. c + 7?v ir o, v^o > v„.i, and + -yP n iv n i = v^i, we have 
(Tv 7r o) i = min A , c + -yPv w o = (c + 7Pv 7r o) <J i (i ) > (0 + 7^1)^1^ = (v^i),. 

Similarly, if i G S2, then 

(Tv^ = max Ai c + jPv^o > (c + 7Pv 7r o) T i (i) > (c + 7Pv 7r i) r i (i) = (v T i)j. □ 

Using Lemma 15.31 we immediately get: 

Lemma 5.4. Let (v k )^ =0 be the value vectors generated by Strategy-Iteration^ ), and let (u fc )£L 
be the value vectors generated by Value- Iteration ( v^o, 0), where ir° = (<r°, r 2 (cr )). Then, v k < u k , 
for every < k < N. 

Proof. We prove the lemma by induction. We have v° = u°. Suppose now that v fc < u k . Then, by 
Lemma 15.31 and the monotonicity of the value iteration operator, we have: 

v fc+1 < Tv k < Tu k = u k+1 . □ 
Combining Lemmas 14.81 and 15.41 we get 

Lemma 5.5. Let (v fc )^ =0 be the sequence of value vectors generated by Strategy-Iteration^ ), 
for some <7° G IIi . Let v* be the optimal value vector. Then, for every < k < N we have 

\\v k -v*\\oo < T^K-v*!^. 
6 Strongly polynomial bound 

In this section, the main section of the paper, we present our strongly polynomial bound on the number 
of iterations performed by the strategy iteration algorithm. We begin with some technical lemmas. 

Lemma 6.1. Let ir' , tt be two strategy profiles such that v T / > v,,- and let a = vr'(i) where i G 5. Then, 

Proof. (y-K')i ~ = (c + 7Pv 7r /) a - (v w )i > (c + 7Pv 7r ) a - (v^)^ = {c n ) a . □ 

Lemma 6.2. Let tt",tt be two strategy profiles such that v n ii > v„- and let a = argmax^/, C 71 ". Then, 

\\Vtt" - VttIIi < (c 7r ) a - 

1-7 

Proof. As v„-// > v T , we get using Lemma 13.121 and then Lemma 13.101 that 

Hvtt" - v^lli = e T {w^ - v T ) = x 7r »(c 7r ) 7r /' < x 7r //e(c 7r ) a = (c 7r ) a . □ 

1-7 

Lemma 6.3. Let tt",tt',7t be three strategy profiles such that v^h > v^i > v w . Let 
and suppose that a E tt'. Then, 

11 w -> l -^\\ II 

n 
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Proof. Let i £ S be the state for which = vr'(i) = a. By Lemma 16. II and Lemma 16.21 we get 

1-7 

\Wtt' - Vtt 1 > (v 7r /-V 7r ) i > (C 7r ) a > IVttW-Vtt l. □ 

n 

Lemma 6.4. Lei {o~ k )^ =0 be the sequence of player 1 strategies generated by the Strategy-Iteration 
algorithm, starting from some initial strategy a . Let L = log^ j^- Then, every strategy a k contains 
an action that does not appear in any strategy a £ , where k + L < t < N . 

Proof. Let (7T fc )|L , where 7r fc = (a k , r fc ), be the sequence of strategy profiles generated by the strategy 
iteration algorithm. By the correctness of the strategy iteration algorithm, ir* = ir N is composed of 
optimal strategies for the two players. Let a = argmax^- c 77 * . By Lemma |3.7| we have (c 7r *) a > 
for every a € A , and (c 7r *) a < for every a G A 2 . We may assume, therefore, that a € A 1 , i.e., 
that a is an action controlled by player 1. Suppose, for the sake of contradiction, that a € tt e , for some 
k + L < £ < N. Using Lemma HT3l with ir" = -ir k , -it' = ir e and tt = ir*, we get that 

n 



On the other hand, using Lemma 15.51 we § e t that 
Thus, 



v_« - v^Hoo < y fe ||v^ 



— v^* ||i < n ||v T £ - v^Hoo < 717^ k \\\ % h - v n * ||oo < 117^ fc ||v^ fe - v w * ||i. 



It follows that n^ 1 k > i— 1 and hence 



1 L > 1^ > 



7J- 

a contradiction. □ 
Theorem 6.5. The Strategy-Iteration algorithm, starting from any initial strategy, terminates 

2 

with an optimal strategy after at most {m + 1)(1 + log^ jh^) = 0(j^- log jztz) iterations. 

Proof. Let L = [1 + log l j 1 ■ Consider strategies a°,a^, a 2 ^, .... By Lemma [631 every strategy in 

this subsequence contains a new action that would never be used again. As there are only m actions, 

— 2 
the total number of strategies in the sequence is at most {m + 1)L = (m + 1)(1 + log 1 / 7 j^)- Finally, 

note that log 1/7 x = < . □ 

7 Concluding remarks 

We have shown that the strategy iteration algorithm is strongly polynomial for 2TBSGs with a fixed 
discount factor. Friedmann [14J, on the other hand, has recently shown that the strategy iteration 
algorithm is exponential for non-discounted 2TBSG, or when the discount factor is part of the input. 

The existence of polynomial time algorithms for 2TBSGs when the discount factor is part of the 
input, or for the non-discounted case, remains an intriguing and a challenging open problem, with 
many possible consequences for complexity theory and automatic verification. As shown by Andersson 
and Miltersen [I], this is equivalent to finding a polynomial time algorithm for Condon's [6] Simple 
Stochastic Games (SSGs). Such an algorithm will immediately provide polynomial time algorithms 
for Mean Payoff Games (MPGs) (see [TO],[l5],[36]) and Parity Games (PGs) (see, e.g., [IT], [33], [19]). 

We believe that our results give some hope of obtaining a polynomial time algorithm for this problem. 
In an earlier work, Ye [34] gave a polynomial time algorithm for the analogous MDP problem. His 
algorithm uses interior point methods and its analysis relies again on the LP formulation of the MDP 
problem. Given the "deLPfication" of Ye's |35] analysis of the policy iteration algorithm carried out 
here, one could speculate that looking at interior point methods for the two-player case, with Ye's 
algorithm for MDPs as a starting point, would be a fertile approach. 
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